Project Overview
Current project tree
.
├── LICENSE
├── README.md
├── Rplots.pdf
├── cicd.png
├── config
│  ├── config.yaml
│  ├── samples.tsv
│  └── units.tsv
├── dags
│  ├── rulegraph.png
│  └── rulegraph.svg
├── data
│  └── metadata
├── images
│  ├── PRJEB21612_gps.png
│  ├── PRJNA477349_gps.png
│  ├── PRJNA477349_variable_freq.png
│  ├── PRJNA477349_variable_freq.svg
│  ├── PRJNA685168_variable_freq.png
│  ├── PRJNA685168_variable_freq.svg
│  ├── PRJNA802976_gps.png
│  ├── bkgd.png
│  ├── gpsfiles
│  ├── imap.png
│  ├── metadata.png
│  ├── smkreport
│  └── sra_run_selector.png
├── imap-sample-metadata.Rproj
├── index.Rmd
├── library
│  ├── apa.csl
│  ├── imap.bib
│  └── references.bib
├── report.html
├── resources
├── results
│  ├── PRJEB21612_read_size_asc.csv
│  ├── PRJEB21612_read_size_desc.csv
│  ├── PRJEB21612_sra_accessions.txt
│  ├── PRJNA477349_read_size_asc.csv
│  ├── PRJNA477349_read_size_desc.csv
│  ├── PRJNA477349_sra_accessions.txt
│  ├── PRJNA685168_read_size_asc.csv
│  ├── PRJNA685168_read_size_desc.csv
│  ├── PRJNA685168_sra_accessions.txt
│  ├── PRJNA802976_read_size_asc.csv
│  ├── PRJNA802976_read_size_desc.csv
│  ├── PRJNA802976_sra_accessions.txt
│  └── project_tree.txt
├── styles.css
└── workflow
├── Snakefile
├── envs
├── reports
├── rules
├── schemas
└── scripts
16 directories, 41 files
Current snakemake workflow
General overview
What is metadata?
- Metadata is a set of data that describes and provides information about other data. It is commonly defined as data about data.
- Sample metadata described in this book refers to the description and context of the individual sample collected for a specific microbiome study.
Metadata structure
- Metadata collected at different stages are typically organized in an
Excel or Google spreadsheet where:
- The metadata table columns represent the properties of the samples.
- The table rows contain information associated with the samples.
- Typically, the first column of sample metadata is Sample ID, which designates the key associated to individual sample
- Sampl ID must be unique.
Embedded metadata
- In most cases, you will find the metadata detached from the experimental data.
- Embedded metadata integrates the experimental data especially for graphics.
- Major microbiome analysis platforms require sample metadata, commonly referred to as mapping file when performing downstream analysis.
Explore SRA metadata
Brief overview
Typically, after sequencing the microbiome DNA, the investigators are encouraged to deposit the sequence reads in a public repository. The Sequence Read Archive (SRA) is currently the best bioinformatics database for read information. The good thing about SRA is that it integrates data from the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).
Downloading metadata via SRA Run Selector [#runselector]
Metadata associated with a specific project can be retrieved manually
via the SRA Run Selector or using the Entrez Direct
(edirect) scipts.
- Note that the SRA filename for metadata is automatically named
SraRunTable.txt, but for clarity we will provide a filename
corresponding to the NCBI-BioProject ID with
.CSVextension. - We will save the metadata file in
data/metdata/folder.
Let’s create the folder (if it doesn’t exist!).
For demo: We will explore more on sample metadata retrieved from four randomly selected microbiome BioProjects, including:
- PRJNA477349: 16S: rRNA from bushmeat samples collected from Tanzania Metagenome
- PRJNA802976: 16S: Changes to Gut Microbiota following Systemic Antibiotic Administration in Infants
- PRJNA685168: WGS: Multi-omics suggest diverse mechanisms for response to biologic therapies in IBD
- PRJEB21612: WGS: Alterations of the gut microbiome in hypertension
Example screen shot of SRA Run Selector for metadata associated with the NCBI-SRA bioproject number PRJNA477349
How many rows and columns
Getting a clear knowledge about the variables associated with a sample metadata can help in filtering the most important features for downstream analysis.
[1] "There are 133 rows and 36 columns in PRJNA477349 metadata"
[1] "There are 54 rows and 35 columns in PRJNA802976 metadata"
[1] "There are 114 rows and 57 columns in PRJNA685168 metadata"
[1] "There are 117 rows and 49 columns in PRJEB21612 metadata"
Downloading the Entrez SRA runinfo
- The Entrez direct functionalities provide uniform 47 columns for each bioproject.
- Below is a list of associated columns.
[1] "Run" "ReleaseDate" "LoadDate"
[4] "spots" "bases" "spots_with_mates"
[7] "avgLength" "size_MB" "AssemblyName"
[10] "download_path" "Experiment" "LibraryName"
[13] "LibraryStrategy" "LibrarySelection" "LibrarySource"
[16] "LibraryLayout" "InsertSize" "InsertDev"
[19] "Platform" "Model" "SRAStudy"
[22] "BioProject" "Study_Pubmed_id" "ProjectID"
[25] "Sample" "BioSample" "SampleType"
[28] "TaxID" "ScientificName" "SampleName"
[31] "g1k_pop_code" "source" "g1k_analysis_group"
[34] "Subject_ID" "Sex" "Disease"
[37] "Tumor" "Affection_Status" "Analyte_Type"
[40] "Histological_Type" "Body_Site" "CenterName"
[43] "Submission" "dbgap_study_accession" "Consent"
[46] "RunHash" "ReadHash"
Note: Full metadata, which is bioproject-specific, can manually be downloaded from the SRA database using the RunSelector option as described above [#runselector]:
Graphical exploration
Demo with PRJNA477349 metadata
The
PRJNA477349contains latitudes and longitudes information which will enable dropping pins on collection sites.
Frequency of variables
Sampling points
Demo with PRJNA685168 metadata
The
PRJNA685168is an IBD study in relation to responses to biologic therapies, it contains sex and age features.
Frequency of variables
Demo with PRJNA802976 metadata
The
PRJNA802976is gut microbiota study in relation to changes following systemic Antibiotic Administration in Infants.
Sampling points
Demo with PRJEB21612 metadata
The
PRJEB21612is an hypertension study in relation alterations of the gut microbiome.
Sampling points
References
Appendix
Static Snakemake report
The interactive snakemake html report can be viewed by opening the
report.htmlusing any compartible browser. You will be able to explore the workflow and the associated statistics. You will also be able to close the left bar to get a better wider view of the display.
Troubleshooting
- CiteprocXMLError: Missing root element
- Maybe the CSL file is empty. Some examples of citation style language are available on Github[1].